Boolean query construction is often critical for medical systematic review literature search. To create an effective Boolean query, systematic review researchers typically spend weeks coming up with effective query terms and combinations. One challenge to creating an effective systematic review Boolean query is the selection of effective MeSH Terms to include in the query. In our previous work, we created neural MeSH term suggestion methods and compared them to state-of-the-art MeSH term suggestion methods. We found neural MeSH term suggestion methods to be highly effective. In this demonstration, we build upon our previous work by creating (1) a Web-based MeSH term suggestion prototype system that allows users to obtain suggestions from a number of underlying methods and (2) a Python library that implements ours and others' MeSH term suggestion methods and that is aimed at researchers who want to further investigate, create or deploy such type of methods. We describe the architecture of the web-based system and how to use it for the MeSH term suggestion task. For the Python library, we describe how the library can be used for advancing further research and experimentation, and we validate the results of the methods contained in the library on standard datasets. Our web-based prototype system is available at http://ielab-mesh-suggest.uqcloud.net, while our Python library is at https://github.com/ielab/meshsuggestlib.
translated by 谷歌翻译
Medical systematic reviews typically require assessing all the documents retrieved by a search. The reason is two-fold: the task aims for ``total recall''; and documents retrieved using Boolean search are an unordered set, and thus it is unclear how an assessor could examine only a subset. Screening prioritisation is the process of ranking the (unordered) set of retrieved documents, allowing assessors to begin the downstream processes of the systematic review creation earlier, leading to earlier completion of the review, or even avoiding screening documents ranked least relevant. Screening prioritisation requires highly effective ranking methods. Pre-trained language models are state-of-the-art on many IR tasks but have yet to be applied to systematic review screening prioritisation. In this paper, we apply several pre-trained language models to the systematic review document ranking task, both directly and fine-tuned. An empirical analysis compares how effective neural methods compare to traditional methods for this task. We also investigate different types of document representations for neural methods and their impact on ranking performance. Our results show that BERT-based rankers outperform the current state-of-the-art screening prioritisation methods. However, BERT rankers and existing methods can actually be complementary, and thus, further improvements may be achieved if used in conjunction.
translated by 谷歌翻译
Entity Alignment (EA), which aims to detect entity mappings (i.e. equivalent entity pairs) in different Knowledge Graphs (KGs), is critical for KG fusion. Neural EA methods dominate current EA research but still suffer from their reliance on labelled mappings. To solve this problem, a few works have explored boosting the training of EA models with self-training, which adds confidently predicted mappings into the training data iteratively. Though the effectiveness of self-training can be glimpsed in some specific settings, we still have very limited knowledge about it. One reason is the existing works concentrate on devising EA models and only treat self-training as an auxiliary tool. To fill this knowledge gap, we change the perspective to self-training to shed light on it. In addition, the existing self-training strategies have limited impact because they introduce either much False Positive noise or a low quantity of True Positive pseudo mappings. To improve self-training for EA, we propose exploiting the dependencies between entities, a particularity of EA, to suppress the noise without hurting the recall of True Positive mappings. Through extensive experiments, we show that the introduction of dependency makes the self-training strategy for EA reach a new level. The value of self-training in alleviating the reliance on annotation is actually much higher than what has been realised. Furthermore, we suggest future study on smart data annotation to break the ceiling of EA performance.
translated by 谷歌翻译
Entity Alignment (EA) aims to find equivalent entities between two Knowledge Graphs (KGs). While numerous neural EA models have been devised, they are mainly learned using labelled data only. In this work, we argue that different entities within one KG should have compatible counterparts in the other KG due to the potential dependencies among the entities. Making compatible predictions thus should be one of the goals of training an EA model along with fitting the labelled data: this aspect however is neglected in current methods. To power neural EA models with compatibility, we devise a training framework by addressing three problems: (1) how to measure the compatibility of an EA model; (2) how to inject the property of being compatible into an EA model; (3) how to optimise parameters of the compatibility model. Extensive experiments on widely-used datasets demonstrate the advantages of integrating compatibility within EA models. In fact, state-of-the-art neural EA models trained within our framework using just 5\% of the labelled data can achieve comparable effectiveness with supervised training using 20\% of the labelled data.
translated by 谷歌翻译
高质量的医学系统评价需要全面的文献搜索,以确保建议和结果足够可靠。确实,寻找相关的医学文献是构建系统评价的关键阶段,并且通常涉及域(医学研究人员)和搜索(信息专家)专家,以开发搜索查询。基于布尔逻辑,在这种情况下的查询非常复杂,包括标准化术语(例如,医学主题标题(网格)词库)的自由文本项和索引项,并且难以构建。特别是显示网格术语的使用可以提高搜索结果的质量。但是,确定正确的网格术语以在查询中包含很难:信息专家通常不熟悉网格数据库,并且不确定查询网格条款的适当性。自然地,网格术语的全部价值通常不会完全利用。本文研究了基于仅包含自由文本项的初始布尔查询提出网格术语的方法。在这种情况下,我们设计了基于语言模型的词汇和预训练的方法。这些方法有望自动识别高效的网格术语,以包含在系统的审查查询中。我们的研究对几种网格术语建议方法进行了经验评估。我们进一步对每种方法的网格项建议进行了广泛的分析,以及这些建议如何影响布尔查询的有效性。
translated by 谷歌翻译
实体对齐(EA)的目的是匹配引用相同现实世界对象的等效实体,并且是知识图(kg)融合的关键步骤。大多数神经EA模型由于其过度消耗GPU记忆和时间而无法应用于大型现实生活中。一种有希望的解决方案是将大型EA任务分为几个子任务,以便每个子任务只需要匹配原始kg的两个小子图。但是,在不失去效力的情况下分配EA任务是一个挑战。现有方法显示了潜在映射的覆盖范围较低,上下文图中的证据不足以及子任务的大小不同。在这项工作中,我们设计了具有高质量任务部门的大规模EA的分区框架。为了在EA子任务中包括最初存在于大型EA任务中的潜在映射的很大比例,我们设计了一种对应的发现方法,该方法利用了EA任务的局部原理和训练有素的EA模型的力量。我们的对手发现方法独有的是潜在映射的机会的明确建模。我们还介绍了传递机制的证据,以量化上下文实体的信息性,并找到对子任务大小的灵活控制的最有用的上下文图。广泛的实验表明,与替代性的最先进的解决方案相比,分区的EA性能更高。
translated by 谷歌翻译
图像的持久性拓扑特性是一个附加描述符,提供了传统神经网络可能无法发现的见解。该领域的现有研究主要侧重于有效地将数据的拓扑特性整合到学习过程中,以增强性能。但是,没有现有的研究来证明引入拓扑特性可以提高或损害性能的所有可能场景。本文对拓扑特性在各种培训方案中的图像分类有效性进行了详细分析,定义为:训练样本的数量,训练数据的复杂性和骨干网络的复杂性。我们确定从拓扑功能中受益最大的场景,例如,在小数据集中培训简单的网络。此外,我们讨论了数据集的拓扑一致性问题,该问题是使用拓扑特征进行分类的主要瓶颈之一。我们进一步证明了拓扑不一致如何损害某些情况的性能。
translated by 谷歌翻译
可区分的搜索索引(DSI)是一个新的新兴范式,用于信息检索。与索引和检索是两个不同且独立的组件的传统检索体系结构不同,DSI使用单个变压器模型执行索引和检索。在本文中,我们确定并解决了当前DSI模型的重要问题:DSI索引和检索过程之间发生的数据分布不匹配。具体而言,我们认为,在索引时,当前的DSI方法学会学会在长文档文本及其标识之间建立连接,但是在检索中,向DSI模型提供了简短的查询文本以执行文档标识符的检索。当使用DSI进行跨语言检索时,此问题进一步加剧,其中文档文本和查询文本使用不同的语言。为了解决当前DSI模型的这个基本问题,我们为DSI称为DSI-QG的简单而有效的索引框架。在DSI-QG中,文档由索引时间的查询生成模型生成的许多相关查询表示。这允许DSI模型在索引时将文档标识符连接到一组查询文本,因此减轻索引和检索阶段之间存在的数据分布不匹配。流行的单语言和跨语性通过基准数据集的经验结果表明,DSI-QG明显优于原始DSI模型。
translated by 谷歌翻译
Algorithms that involve both forecasting and optimization are at the core of solutions to many difficult real-world problems, such as in supply chains (inventory optimization), traffic, and in the transition towards carbon-free energy generation in battery/load/production scheduling in sustainable energy systems. Typically, in these scenarios we want to solve an optimization problem that depends on unknown future values, which therefore need to be forecast. As both forecasting and optimization are difficult problems in their own right, relatively few research has been done in this area. This paper presents the findings of the ``IEEE-CIS Technical Challenge on Predict+Optimize for Renewable Energy Scheduling," held in 2021. We present a comparison and evaluation of the seven highest-ranked solutions in the competition, to provide researchers with a benchmark problem and to establish the state of the art for this benchmark, with the aim to foster and facilitate research in this area. The competition used data from the Monash Microgrid, as well as weather data and energy market data. It then focused on two main challenges: forecasting renewable energy production and demand, and obtaining an optimal schedule for the activities (lectures) and on-site batteries that lead to the lowest cost of energy. The most accurate forecasts were obtained by gradient-boosted tree and random forest models, and optimization was mostly performed using mixed integer linear and quadratic programming. The winning method predicted different scenarios and optimized over all scenarios jointly using a sample average approximation method.
translated by 谷歌翻译
Large machine learning models with improved predictions have become widely available in the chemical sciences. Unfortunately, these models do not protect the privacy necessary within commercial settings, prohibiting the use of potentially extremely valuable data by others. Encrypting the prediction process can solve this problem by double-blind model evaluation and prohibits the extraction of training or query data. However, contemporary ML models based on fully homomorphic encryption or federated learning are either too expensive for practical use or have to trade higher speed for weaker security. We have implemented secure and computationally feasible encrypted machine learning models using oblivious transfer enabling and secure predictions of molecular quantum properties across chemical compound space. However, we find that encrypted predictions using kernel ridge regression models are a million times more expensive than without encryption. This demonstrates a dire need for a compact machine learning model architecture, including molecular representation and kernel matrix size, that minimizes model evaluation costs.
translated by 谷歌翻译